Exploratory Data Analysis by Allan Visochek

Using the Prosper Loan Company Dataset

06-01-20015

Load libraries and data

Univariate Plots Section

BorrowerRate, LoanOriginalAmount & Term

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

BorrowerRate Ranges from 0 to .05

##     10% 
## 0.09886
##    90% 
## 0.3099

Most BorrowerRates are between .1 and .31.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

all loans are between 0 and 35,000…

75% of loans are under 12,000…

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

All loans have a term of either 1,3 or 5 years.

Borrower Info

##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

The large majority of loans go to individuals who are employed full time.

IsBorrowerHomeowner:

## False  True 
## 56459 57478

about half and half

##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##  5515   200  1679   855  1901 14717  2210  1627   382   300  6720  5008 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##   409   186   599  5921  2078  1062   983   954  2242  2821   101  3593 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##  2318  2615   787   330  3084    52   674   551  3097   472  1090  6729 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4197   971  1817  2972   435  1122   189  1737  6842   877  3278   207 
##    WA    WI    WV    WY 
##  3048  1842   391   150

pretty even distribution by state population

##                                                        Accountant/CPA 
##                               3588                               3233 
##           Administrative Assistant                            Analyst 
##                               3688                               3602 
##                          Architect                           Attorney 
##                                213                               1046 
##                          Biologist                         Bus Driver 
##                                125                                316 
##                         Car Dealer                            Chemist 
##                                180                                145 
##                      Civil Service                             Clergy 
##                               1457                                196 
##                           Clerical                Computer Programmer 
##                               3164                               4478 
##                       Construction                            Dentist 
##                               1790                                 68 
##                             Doctor                Engineer - Chemical 
##                                494                                225 
##              Engineer - Electrical              Engineer - Mechanical 
##                               1125                               1406 
##                          Executive                            Fireman 
##                               4311                                422 
##                   Flight Attendant                       Food Service 
##                                123                               1123 
##            Food Service Management                          Homemaker 
##                               1239                                120 
##                           Investor                              Judge 
##                                214                                 22 
##                            Laborer                        Landscaping 
##                               1595                                236 
##                 Medical Technician                  Military Enlisted 
##                               1117                               1272 
##                   Military Officer                        Nurse (LPN) 
##                                346                                492 
##                         Nurse (RN)                       Nurse's Aide 
##                               2489                                491 
##                              Other                         Pharmacist 
##                              28617                                257 
##         Pilot - Private/Commercial  Police Officer/Correction Officer 
##                                199                               1578 
##                     Postal Service                          Principal 
##                                627                                312 
##                       Professional                          Professor 
##                              13628                                557 
##                       Psychologist                            Realtor 
##                                145                                543 
##                          Religious                  Retail Management 
##                                124                               2602 
##                 Sales - Commission                     Sales - Retail 
##                               3446                               2797 
##                          Scientist                      Skilled Labor 
##                                372                               2746 
##                      Social Worker         Student - College Freshman 
##                                741                                 41 
## Student - College Graduate Student           Student - College Junior 
##                                245                                112 
##           Student - College Senior        Student - College Sophomore 
##                                188                                 69 
##        Student - Community College         Student - Technical School 
##                                 28                                 16 
##                            Teacher                     Teacher's Aide 
##                               3759                                276 
##              Tradesman - Carpenter            Tradesman - Electrician 
##                                120                                477 
##               Tradesman - Mechanic                Tradesman - Plumber 
##                                951                                102 
##                       Truck Driver                    Waiter/Waitress 
##                               1675                                436

Credit and Ratings

##           A    AA     B     C     D     E    HR 
## 29084 14551  5372 15581 18345 14274  9795  6935
##           A    AA     B     C     D     E    HR    NC 
## 84984  3315  3509  4389  5649  5153  3289  3508   141

A large number of Borrowers either don’t have a prosper rating or a credit score…

lets create new variables to indicate weather a user has a credit grade or a prosper rating

I’m also going to switch the order so that AA comes before A

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

Most borrowers have a credit score between 600 and 800

Income

##             $0      $100,000+      $1-24,999 $25,000-49,999 $50,000-74,999 
##            621          17337           7274          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

Few loans are given out to individuals with low income, or who are unemployed, no surprises here.

Let’s reorder IncomeRange and create a new variable to indicate weather a borrower is reported as having an income.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

Debt

There are a few outliers at 10.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Most debt levels are in the range of 0 to 0.5

Quarter

Lets reorder these by time and try again

Thats better… we can see here that the number of loans given out dropped dramatically in the last quarter of 2008, ….

Univariate Analysis

What is the structure of your dataset?

There are 113,937 loans in the dataset with 81 features, 14 of which were used in the analysis:

Numerical Variables:

BorrowerRate

CreditScoreRangeLower

DebtToIncomeRatio

LoanOriginalAmount

StatedMonthlyIncome

Ordered factor variables:

(from best to worst / greatest to least…)
Term:

60,36,12

(note that term was a numerical variable but was transformed to a factor variable because it has very few values)

CreditGrade:

AA,A,B,C,D,E,HR,NC,none

ProsperRating..Alpha.:

AA,A,B,C,D,E,HR,none

IncomeRange:

$75,000-99,999 ; $50,000-74,999 ; $1-49,999 1-24,999 ; $0 ; Not employed; Not displayed

Unordered factor variables:

EmploymentStatus:

Employed, Full-time, Not employed, Part-time, Retired, Self-employed none,Not available, Other,

Occupation:

Accountant/CPA, Administrative Assistant, Analyst, Architect, Attorney, Biologist, Bus Driver, Car Dealer, Chemist, Civil Service, Clergy, Computer Programmer, Construction, Dentist, Doctor, Engineer - Chemical, Engineer - Electrical, Engineer - Mechanical, Executive, Fireman, Flight Attendant, Food Service, Food Service Management, Homemaker, Investor, Judge, Laborer, Landscaping, Medical Technician, Military Enlisted, Military Officer, Nurse (LPN), Nurse (RN), Nurse’s Aide, Other, Pharmacist, Pilot - Private/Commercial, Police Officer/Correction Officer, Postal Service, Principal, Professional, Professor, Psychologist, Realtor, Religious, Retail Management, Sales - Commission, Sales - Retail, Scientist, Skilled Labor, Social Worker, Student - College Freshman, Student - College Graduate Student, Student - College Junior, Student - College Senior, Student - College Sophomore, Student - Community College, Student - Technical School, Teacher, Teacher’s Aide, Tradesman - Carpenter, Tradesman - Electrician, Tradesman - Mechanic, Tradesman - Plumber, Truck Driver, Waiter/Waitress

BorrowerState:

AK, AL, AR, AZ, CA, CO, CT, DC, DE, FL, GA, HI, IA, ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MS, MT, NC, ND, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VA, VT, WA, WI, WV, WY

IncomeVerifiable:

TRUE,FALSE

Other observations.

BorrowerRate Ranges from 0 to .05

Most BorrowRates are between .1 and .31

Loans amounts range from 0 to $35,000.

75% of loans are for under $12,000.

All loans have a term of either 1,3 or 5 years.

The large majority of loans go to individuals who are employed full time.

A large number of Borrowers either don’t have a prosper rating or don’t have a credit score.

Few loans are given out to individuals with low income, or who are unemployed.

The number of loans given out dropped dramatically in the last quarter of 2008

What is/are the main feature(s) of interest in your dataset?

The main feature of interest for this investigation is the BorrowerRate.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It is hard to say at this point. All variables mentioned above were selected for the investigation because they are likely to have an impact on the borrower rate.

Did you create any new variables from existing variables in the dataset?

HasCreditGrade <- CreditGrade is available

HasProsperRating <- ProsperRating is available

HasIncome >> IncomeRange is available and greater than 0

Additional variables are created in later sections

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log transformed the following variables, which had a left leaning distribution:

DebtToIncomeRatio

StatedMonthlyIncome

I reordered the following factor variables:

ProsperScore..Alpha,

CreditGrade,

Incomerange,

LoanOrginationQuarter

I also subsetted the data for StatedMonthlyIncome by IncomeVerifiable.

Bivariate Plots Section

Overview

## Term: 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.0929  0.1434  0.1501  0.2064  0.2669 
## -------------------------------------------------------- 
## Term: 36
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1274  0.1815  0.1935  0.2599  0.4975 
## -------------------------------------------------------- 
## Term: 60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0669  0.1490  0.1870  0.1930  0.2319  0.3304

The average Borrower Rate was decreasing significantly from 2012 to 2014

Loan Amount

The average Loan amount went down steeply in Q4 2008 and rose steadily until 2014.

Debt Level

It doesn’t look like there is a significant relationship from this graph.

It looks like there is a trend, although it is still not quite clear.

Debt level quantile and median BorrowerRate are correlated.

Credit score and ratings

Borrowers without a credit grade have borrower rate that is average relative to the others…

Borrower rates are cleanly distributed among the different prosper ratings. (in contrast to the credit grades which overlap…)

Borrowers without a prosper rating have a median borrower rate that is average relative to the others…

Credit score varies much less among different prosper ratings than it does with credit grades…

Income

(IncomeVerifiable = “True”)

Borrower Info

Unemployed borrowers have a significantly higher Median Borrower Rate than the others

Homeowners have a lower median BorrowerRate than non-homeowners.

The worst States to get a loan are Maine and Indiana. The Best States are Alabama and North Dakota

ggplot(aes(y=BorrowerRate,x=reorder(Occupation, BorrowerRate,FUN=median )),data=loanData)+
  geom_boxplot()+
  coord_flip()

In General, Higher Paying Occupations with a higher level of education (i.e. Judge, Computer Programmer Engineer) have a lower median borrower rate than lower paying occupations with lower levels of education (i.e. Teacher’s Aide, Nurse’s Aide, College Freshman).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The average Borrower Rate was decreasing significantly from 2012 to 2014.

The average Loan amount went down steeply in Q4 2008 and rose steadily until 2014.

The Quantile of DebtToIncomeRatio and Median BorrowerRate are correlated.

Borrowers without a credit grade have borrower rate that is average relative to the others…

BorrowerRates are cleanly distributed among the different prosper ratings. (in contrast to the credit grades which overlap…)

Borrowers without a prosper rating have a median borrower rate that is average relative to the others…

The Median BorrowerRate goes down as income range goes up.

There is a clean and linear relationship between the StatedMonthlyIncome Quantile and the Median BorrowerRate.

Unemployed borrowers have a significantly higher Median BorrowerRate than the others.

Homeowners have a lower median BorrowerRate than non-homeowners.

The worst States to get a loan are Maine and Indiana. The Best States are Alabama and North Dakota.

In General, Higher Paying Occupations with a higher level of education (i.e. Judge, Computer Programmer Engineer) have a lower median borrower rate than lower paying occupations with lower levels of education (i.e. Teacher’s Aide, Nurse’s Aide, College Freshman).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Credit score varies much less among different prosper ratings than it does with credit grades…

Most borrowers who have an IncomeRange Recorded have a StatedMonthlyIncome that falls within this range, however there are quite a few exceptions.

What was the strongest relationship you found?

The strongest Relationships were between the BorrowerRate and CreditGrade, and BorrowerRate and ProsperRating.

Multivariate Plots Section

Overview

The distribution of borrower rates is less even from 2009 through 2012.

This may have to do with the relatively low number of borrowers during that time period.

The average loan amount is closely correlated with the number of loans given out.

The inverse of the BorrowerRate seems to follow the trends in the number of loans and LoanOriginalAmount by about 2 years.

CreditGrade&Ratings

Credit score by Quarter

CreditGrade/ProsperRating

Ok, now I get it… The company used credit grade to determine the borrower rate up until 2009 and then started using their own metric namely, the Prosper Rating

Lets make a new variable to separate the the ealier loans from the latter.

Credit Score

Credit score is correlated to Borrower Rate, although that correlation changes by quarter.

As of 2009, there is a strict cutoff at a credit score of 600.

It looks like there isn’t much variation in Borrower Rate by credit Score for those borrowers with a credit score of less than 600.

I will define bad credit as having a credit score of less than 600 and treat these loans as a separate category.

LoanOriginalAmount

LoanOriginalAmount by quarter

There is at least some relationship between LoanOriginalAmount and BorrowerRate.

All loans in the earlier period are 3 year loans.

LoanOriginalAmount by quarter and credit grade

The correlation between the borrower rate and the loan amount becomes higher as the CreditGrade goes up…

LoanOriginalAmount by quarter and ProsperRating for 3 year loans

It looks like loan original amount was used to determine the BorrowerRate up until 2011.

Lets look at this in a bit more detail…

The correlation between loan amount and borrower rate also goes up as prosper rating goes up.

1 and 5 year loands

1 year loans were only offered as of Q4 2010, so the loan original amount clearly does not effect the borrower rate for any such loans.

the same goes for the 5 year loans…

Income Level

The median borrower rate is related to IncomeRange.

The trend towards lower BorrowerRate for higher StatedMonthlyIncome is consistent accross all of the quarters.

StatedMonthlyIncome

CreditGrade

#####ProsperRating

StatedMonthlyIncome Varies significantly more accross different ProsperRatings than it does accross different CrediGrades.

Debt

DebtToIncomeRatio

The trend towards higher BorrowerRate for Higher DebtToIncomeRatio is consistent accross all of the quarters.

AvailableMonthlyIncome

(1-DebtToIncomeRatio)*StatedMonthlyIncome

The quantile Of Available Monthly income is correlated to the median Borrower Rate.

Linear Models

(Data is subsetted to exclude borrowers with a credit score of 600 or less)

Model with CreditGrade and LoanOriginalAmount

## 
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount + (CreditScoreRangeLower))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.216688 -0.029347 -0.009552  0.017831  0.216400 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.027e-01  4.152e-03  145.16   <2e-16 ***
## LoanOriginalAmount     2.373e-06  5.662e-08   41.91   <2e-16 ***
## CreditScoreRangeLower -6.748e-04  6.092e-06 -110.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0465 on 19034 degrees of freedom
## Multiple R-squared:  0.3978, Adjusted R-squared:  0.3977 
## F-statistic:  6286 on 2 and 19034 DF,  p-value: < 2.2e-16

Model with log10(CreditGrade) and LoanOriginalAmount

## 
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount + log10(CreditScoreRangeLower - 
##     600))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.206171 -0.025775 -0.008788  0.016537  0.221467 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         3.680e-01  2.026e-03   181.6   <2e-16
## LoanOriginalAmount                  2.485e-06  5.699e-08    43.6   <2e-16
## log10(CreditScoreRangeLower - 600) -1.259e-01  1.110e-03  -113.4   <2e-16
##                                       
## (Intercept)                        ***
## LoanOriginalAmount                 ***
## log10(CreditScoreRangeLower - 600) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04479 on 17981 degrees of freedom
## Multiple R-squared:  0.4215, Adjusted R-squared:  0.4214 
## F-statistic:  6550 on 2 and 17981 DF,  p-value: < 2.2e-16

Plots of model w/ color gradeint

IncomeRange

StatedMontlyIncome

AvailableMonthlyIncome

DebtToIncomeRatio

Model with log10(CreditGrade), LoanOriginalAmount and DebtToIncomeRatio

## 
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount + log10(CreditScoreRangeLower - 
##     600) + DebtToIncomeRatio)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.221211 -0.024666 -0.008269  0.015813  0.222656 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         3.562e-01  2.066e-03  172.38   <2e-16
## LoanOriginalAmount                  2.304e-06  5.683e-08   40.55   <2e-16
## log10(CreditScoreRangeLower - 600) -1.232e-01  1.101e-03 -111.84   <2e-16
## DebtToIncomeRatio                   3.146e-02  1.438e-03   21.88   <2e-16
##                                       
## (Intercept)                        ***
## LoanOriginalAmount                 ***
## log10(CreditScoreRangeLower - 600) ***
## DebtToIncomeRatio                  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04401 on 17871 degrees of freedom
## Multiple R-squared:  0.439,  Adjusted R-squared:  0.4389 
## F-statistic:  4662 on 3 and 17871 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The average loan amount is closely correlated with the number of loans given out.

Credit score is correlated to Borrower Rate, although that correlation changes by quarter.

As of 2009, there is a strict cutoff at a credit score of 600.

There isn’t much variation in Borrower Rate by credit Score for those borrowers with a credit score of less than 600.

The correlation between the borrower rate and the loan amount becomes higher as the CreditGrade goes up.

The correlation between the borrower rate and the loan amount becomes higher as the Prosper Rating goes up.

There is a correlation between loan amount and borrower rate exists up until 2011.

The median borrower rate is related to IncomeRange.

The trend towards lower BorrowerRate for higher StatedMonthlyIncome is consistent accross all of the quarters.

The trend towards higher BorrowerRate for Higher DebtToIncomeRatio is consistent accross all of the quarters.

The quantile Of Available Monthly income is correlated to the median Borrower Rate.

Were there any interesting or surprising interactions between features?

The inverse of the BorrowerRate follows the trends in the number of loans and LoanOriginalAmount by about 2 years.

CreditGrades were only used up until 2009, after which ProsperRatings were primarily used to determine BorrowerRate. ProsperRatings Are much less dependent on CreditScore.

All loans in the earlier period are 3 year loans and 1 and 5 year loans are only offered as of 2011 (after LoanOriginalAmount was no longer used to determine BorrowerRate).

StatedMonthlyIncome Varies significantly more accross different ProsperRatings than it does accross different CrediGrades.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I creted the following linear models, in order.

The models apply to a subset of the data taken from 2006 to 2009 with a CreditScoreRangeLower of 600 or less.

Using CreditScoreRangeLower and LoanOriginalAmount

R^2: 0.3977

Using log10(CreditScoreRangeLower), LoanOriginalAmount and DebtToIncomeRatio

R^2: 0.4214

Using CreditScoreRangeLower and LoanOriginalAmount

R^2: 0.4505

(For the final model, the data is further subsetted to exclude DebtToIncomeRatios above 0.5)

The best model accounts for 45% of the variation in the Borrower Rate.

Final Plots and Summary

Plot One

Description One

The average loan amount is closely correlated with the number of loans given out, while, the inverse of the BorrowerRate follows the trends in the average loan amount and number of loans by about 2 years.

Plot Two

Description Two

In General, Higher Paying Occupations with a higher level of education (i.e. Judge, Computer Programmer Engineer) have a lower median borrower rate than lower paying occupations with lower levels of education (i.e. Teacher’s Aide, Nurse’s Aide, College Freshman).

Plot Three

Description Three

Variation in the BorrowerRate from 2006 to 2009 is explained primarily by the LoanOriginalAmount and the log transform of the Credit Score for the borrowers with a credit score of over 600.


Reflection

This prosper loan data is a rich dataset with multiple inerrelated variables. It is also relatively sophisticate with many changes over time over the course of the data collection.

In my investigation of the BorrowerRate, I was able to find some interesting trends, and was able to create a linear model a sebset of the data with high variance. However, I was not able to fully understand all of the factors influencing the BorrowerRate in this investigation.

While it is reasonable to expect that there would be natural variation in the data, it is likely that a loan company such as this one would use a rigorous method to calculate risk separately for each borrower, and distribute the borrower rates accordingly. If this is the case, reverse engineering the method used for the BorrowerRate would require a more thorough investigation of all of the data available in the dataset.